(19) 



Europaisches Patentarht 
European Patent Office 
Office europeen des brevets 




(12) 



(43) Date of publication: 

02.01.1997 Bulletin 1997/01 



(11) EP 0 751 470 A1 

EUROPEAN PATENT APPLICATION 

(51) Intel 6: G06F 17/30 



(21) Application number: 96304778,2 

(22) Date of filing: 28.06.1996 



(84) Designated Contracting States: 
DE FR GB 

(30) Priority: 28.06.1995 US 495865 

(71) Applicant: XEROX CORPORATION 
Rochester New York 14644 (US) 

(72) Inventors: 

• Kupiec, Julian M. 
Cupertino, California 95014 (US) 

• Pedersen, Jan O. 

Palo Alto, California 94303 (US) 



• Chen, Francine R. 

IVIenIo Park, California 94025 (US) 

• Brotsky, Daniel C. 
Berkeley, California 94707 (US) 

• Putz, Steven B. 

Santa Clara, California 95051 (US) 

(74) Representative: Reynolds, Julian David et al 
Rank Xerox Ltd 
Patent Department 
Parlcway 

Marlow Buckinghamshire SL7 1 YL (GB) 



(54) Automatic method of generating feature probabilities for automatic extracting 
summarization 



O 

to 

o 
o. 

UJ 



(57) A method of automatically generating feature 
probabilities that allow later automatic generation of 
document extracts. The computer system generates the 
probabilities by analyzing each document a document 
at a time. First, the computer system designates one of 
the documents as a selected document. Next, the com- 
puter system analyzes each sentence of the selected 
document to determine the value of the paragraph fea- 
ture and the value of the uppercase feature. The com- 
puter system repeats this effort for each document of 
the document corpus. Afterward, the number of occur- 
rences of each value of each feature is calculated and 
is used to calculate feature value probabilities for alt of 
the features. 
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Description 

The present invention relates to a method of automatic text processing. In particular, the present invention relates 
to an aut'^at method of generating feature probabilities that can be used later to automafcally create summary 

'''"«eTa:d fxtlTpro^de a concise document description more revealing than a document title, yet brief 
enough'ob" absorbed in a single glance. The desirability of summaries and extracts is increased by the large quantity 
of on-line machine readable, information currently available. H^e,.r,ofi«r, 

T Ldi ional author-supplied indicative abstracts, when available, fulfill the need for a conase ^^^7^^"^^^^=^^^^^ 
The absence of author-supplied abstracts can be overcome with automatically generated document summar.es. Nu- 
merous esearche"s have addressed automatic document summarization. The nominal °* JJ-^^^^^^^^^^^ 
narrative summarizing a document is currently considered too problematic because ,t encompasses discourse unde 
taX ab" raction'and language generation. A simpler approach avoids the ^J/f,^^^ 
standing by defining document summarization as summary by extraction. That is to say. the goal °f t^^'f^PP^°^^2^^^ 
STnd a subset of sentences of a document that are indicative of document content. Typically, under this approach 
document sentences are scored and the highest scoring sentences are selected for extraction 

Nur^erous heuristics have been proposed to score sentences for extracting summarization. Existing evidence 
sugges~combin^^^^ of features yield the best performance. At least one prior extracting ^^^^n^^^^^^f- 
multiprfeatures. which are weighted manually by subjective estimation. Manually assigning feature weights to obtain 
optimal performance is difficult when many features are used. . ,„^=>,i„n hpuri^tics and 

Prior features used for extracting summarization include frequency-keyword heunstics. location heuristics, and 
cue w"d tequencAeyword heuristics use common content words as indicators of the 

LoL^on heuristics assume that important sentences lie at the beginning and end of a document, m the first and last 
^ nf^es Of ^aSraph^^^^^^^^^ immediately below section headings. Cue words ar. words that are likely to accompany 

'"^o^recrre" 

'''^ Another object of the present invention is to combine multiple features together in an extracting summarizer to 
nrovide better extracts than possible using just one feature. 
1 ' rstHHurtheTobject of L present invention is to provide an extracting summarizer whose performance can be 

°''l'me'thrd of'lmatically generating feature probabilities for automatic extracting -""J^^"^.^"- "^^^^^^ 
system ^ provided in accordance with the invention. Given a feature set and a matched training corpus the method 
fnCTs deTmining two kinds of probabilities: the probability of observing a value of a particular feature in a sentence 
s included in the summary and the probability of that feature taking each of its P°^s|ble values _ 

The method further Involves training to obtain probabilities by «"^'y^'"9 "'^^^^^^^^^^^ 
First one of the documents is designated as a selected document. Next, each sentence of the se ected document is 
ana vs^ to determine the values of the paragraph and the uppercase features. This ,s repeated for each documen 

me Scumfnt ""pus. Having evaluated the features of every sentence in the ^^^^^^^ 
occurrences of each value of each feature is calculated. These counts are then used to cateulate feature value prob 

''""l^^sp^iXthe In'e^ provides a method according to claim 1 and 2 of the appended claims. 
ThTpres'J'v^^^^^^^^^ 

drawings In the accompanying drawings, in which similar references indicate similar elements. 

Figure 1 illustrates a computer system for automatically extracting summary sentences from natural language 
FiS'tTzTsaflowdiagramof a method of locating the start of text within a document; 

FiSrI 3 fe^ffow diagrar^ of a method of generating a thematic summary of a document using the computer system 

Fio^JS 4 is'a flow diagram of a method of identifying upper case sentences within a document; 

Figure 5 is a ftow diagram of a method of locating sentences within a document that match sentences within a 

manually generated sumnnary for that document; h«^. .rr^ontc- 

Rgure 6 is a flow diagram of a method of generating feature probabilities given a corpus °' ^^^'^^^'^.^^.^^ 

Figure 7 is a flow diagram of a method of automatically generating an extract for a machine readable representation 

of a natural language document using multiple features and feature probabilities; 

Figures 8 shows a manually generated summary for a document; 

Figure 9 shows relevant paragraphs of the document associated with the summary of Figure 8, and 
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Figure 10 shows the sentences automatically extracted from the document of Figure 9. 

. . ^- 4^.^ ^«r««i.tor cwQtpm 10 In which the present method is implemented. By 
Figure 1 illustrates in blcx:k diagram form '^^"VP"*^:.^^^^^ 10 to extract from a machine readable 

generating feature probabilities the present method enables method alters the operation 

features and independent evaluation of those features on f s^^^^)^^^^^ summarization According to Bayes' rule 

Bayes' rule underlies both the present method and 
the probability of a sentence s being included in a summary S given a set koi sentence characteristics. 
Ff J = 1,2.,.k, can be expressed mathematically as: 

P(F , F^...F^Is£ S)P(sG S) 
P(s e SI F,, F^....Ff,)= P{F^ F^ .,.Ff,) 
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Assuming statistical independence ot the features, the probability of a sentence s being included in a summary S , 
be reformulated as: 

vf /c E S)P(s g S) 

p(s e s/F,, F2,...F^; = n , — "-^^r^^ 

That is to say that, the overall probability of a -ntence s ^^^^^^^ 
of the probabilities provided by each feature evjuatedjnd^^^^^^^ 

takes advantage of this fact to generate probab.lrt.es 'J^ ,'alue for a feature; in a sentence 

be described in detail herein. 

I rnm puter Svstem for Automat ic Extracting Summarization 

Prior to a more detailed discussion of either training or ^^^^^^ 

puter system 10 includes monitor 12 for visual^ ^f'^?;;"^ 'omZ sv^^^^^^ computer user multiple 
outputs information to the computer user v.a P""^^^/ ^^J^^^^ 

avenues to input data. Keyboard 1 4 allows the compu e user to .nput da^ato^^^^^^ ^^^^ ^^^^ .^^^^ 

mouse 16 the computer user is able to move a po.nter d splayed monrtor l^. -J P computer user can 

information to computer system 10 by writing on -'-^-"^ ^^^^^^^^^^^ floppy disk drive 22. 

ir,put data stored on a machine readable med.um. such as a "oPPV d^sk by ^^^^'"9 ^ocu- 

S2^rcroi3r^^^^^^^^ 

^-i:::"::::^^ - operations °^compu-^^^^^^^^ 

computer user. Processor 11 determines and takes ^ppropnate ^^^^^ ^^^^^^^^U^ disk drive 22. Typ- 
executing instructions stored,electronically in ^^'"^'^^^^"^^^^^ frequent and rapid access 

ically. bpiat^nginstructionsior processor 11 (ROM), random 

to the instructions. Semiconductor memory dev-^es that can be used ^^f^^ ^ ^,^^,,33 (PrOM), 

(EEPROM). such as flash memories. 
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II. Features 

A. Feature Description 

Computer system 10 uses sentence characteristics, known as features, to automatically extract sentences likely 
to be selected for Inclusion in a manually generated summary. Preferably, computer system 10 uses five features to 
generate document extracts, although a lesser or greater number may also be used. Preferably, the five features used 
are: sentence length, cue words, sentence location, upper case sentences, and direct theme sentences. Performance 
varies depending upon the combination of features used 

The sentence length feature indicates whether the number of words In a sentence meets or exceeds a minimum 
length. The minimum length Is selected to identify short sentences, like section headings, which are not often included 
in manually generated summaries. In the preferred embodiment the minimum length required for the sentence length 
feature to be true is six words. Sentences of five or fewer words in length have a sentence length feature value of false 
in the preferred embodiment. 

The direct theme feature indicates whether a sentence addresses one of the main themes of a document. The 
direct theme feature uses the intuition that content words frequently used within a document are likely to be indicative 
of that document's theme, A method of identifying such sentences will be described in detail below, the value of the 
direct theme feature indicates whether a sentence is one of the document's direct theme sentences. A sentence that 
has been Identified as a direct theme sentence will have a direct theme feature value of true. Analogously, a sentence 
that has not been Identified as a direct theme sentence will have a direct theme feature value of false. 

The upper case feature indicates whether a sentence Includes important proper names or acronyms, which are 
frequently included within manually generated summaries. The feature is so named because proper names and acro- 
nyms are typically presented using upper case letters, regardless of their position within a sentence. A method of 
Identifying upper case feature sentences will be described in detail below. A sentence that has been identified as an 
upper case sentence will have an upper case feature value of true. Analogously, a sentence that has not been identified 
as an upper case sentence will have an upper case feature value of false. 

The cue word feature indicates whether a sentence includes word sequences that indicate It summarizes the 
document. Such word sequences include: 

this article, the article.thls Investigation, present investigation, this paper.this study, this work, present work, this 
letter, in conclusion, is concluded, conclude that, we conclude, in summary.the results, our results, results show, results 
indicate; results are. 

This list of cue words is not intended to be exhaustive. Other word sequences may indicate that a sentence summarizes 
document content and may be used in conjunction with the methods described herein. 

Methods of identifying sentences including cue words will not be described in detail herein because a method for 
doing so will be obvious to those of ordinary skill. Sentences including cue words have a cue word feature value of 
true, and those not including cue words have a false value. 

The location feature Indicates whether the location of a sentence within a document is such that it is likely to be 
included In a summary. Sentences located at the beginnings and ends of paragraphs are more likely to be included in 
a manually generated summary than sentences in the middle of a paragraph. Further, sentences at the beginning or 
end of a document are more likely to be included in a short summary than sentences in the middle of a document. In 
the preferred embodiment, the beginning of a document is defined as the first five paragraphs after start of text, the 
end is defined as the last five paragraphs of a document, and the middle includes all other paragraphs. Additionally 
the beginning of a paragraph is defined as the first sentence, the end as the last sentence of the paragraph, and the 
middle includes all other sentences within a paragraph. Thus, unlike the other features used, the location feature can 
take more than two values. 

B. Feature Evaluation 

The tokenizer used during training and extracting summarization facilitates valuation of the features described 
above. A token Izer analyzes the machine readable representation of a natural language document and identifies par- 
agraph boundaries, sentence boundaries, and the words within each sentence. Preferably, the tokenizer generates a 
sentence structure for each sentence of a document that includes three pieces of information useful to feature valuation: 
a sentence I.D., sentence position, and sentence length. The sentence I.D. is a unique number Indicating the location 
of a sentence with respect to the start of the document. The sentence position indicates the position of the sentence 
within its paragraph. The sentence length represents the number of words included in the sentence, which facilitates 
quick evaluation of the sentence length feature. . 

Methods of evaluating the selected set of features will be discussed below a feature at a time. There is no reason 
feature evaluation need be done a feature a time, however, and approaches for evaluating multiple features at a time 
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will be discussed below. 



R 1 Fvaluation of the Location Feature 

Evaluation of the-location feature is straightforward if a sentence location within the main body of text is known^ 

m^wmmm 

,.x, body. To ln.,s«9.., poss««l^, lu*e, P'--"' ^ ^^^^^^^ °" .em»r,c. ,o,ms p,. of 

.he s.l«ted s..«nc, and next ""'"^^ '"^"^f.^ ""J^^^^^ L body. In that 

; to step 34 and increments the sentence counter. w»t«r™ina« ri.irina steo 35 whether it has discovered 



6 



EP 0 751 470 A1 

K=,Kar further evaluation of the current paragraph may be 
11 has not yet identified the first paragraph. To ^eterrn-ne whether further e ^^^^ ^^^^^ ^^^^ 

;issible processor 11 advances to step ^^^^l^-J^^^^^^^^^^^^^^ first paragraph of the main text body, in 

s Turing step 37 processor 11 identifies as the ^JV^?;;^^^^^^^^^^^ the first paragraph of the ma.n 

D. is two less than thdt of the selected ^^"^^"'^^^^^^''^^ 'de^^^^^^^^ body of text can be easily determined. wh.ch .n turn 

r^eare^S™^^^^ 

. nrsife;rcrurr^^^^ 

means that processor 11 has not yet discovered ^^^ ^J'^l^^'^^'^^^^^^ bv asking whether all sentences of the selected 
Tl StermTnLwhetheritcancontinuesearchingforthatf.rstp^^^^^^ ^^^^^^^^^ ,3^^ yet been 

Lumen have been examined. Response of processor ^^^^^^J^^^^^^^^Ups 33 and 36. Process branch- 
examTned diSers between the two steps because d.flerent ^^"^^^^^^^'J °e Z paragraph of the main text body, 
,s es to tep 33 whenever the selected sentence - ^e examinee d. the concern of processor 11 

Sr whatever reason. As a result, if the document .ncludes sentences n y ^^^^ ^^l^^^ed 

I, processor 11 d..e-mMS dur.ng step 33 " ,^„a. processor 11 advances to step 38. 

sentences »,th terminating P-c.ua.i» J ? -|XSo. me Hr^ent as the (ire, sentence ol the «rs. paragraph o, 

^ o Pw«i..«tion of T'^^'^^ P^^^"^^ 

' ■ r 11 to evaluate the direct tneme 

Figure 3 illustrates in flow diagram form '"^J^^'^f ^^^"jj^^^^ ^hemTevaluation is performed during training 
feature for each sentence of a document, ^^9^^^'^^^%°' ^^^/"^"^^^^^ne readable form in solid state memory 25 or on 
orTx^racting summarization. Instructions 40 may be ^^^^ "l^^'H ealized in any computer language, includ.ng 
a ffoppy disk placed within floppy disk drive 22. '"^^.^^'^^ "^^^^Jdi ^ct theme sentences by first generating a list 

USP and C Briefly described, processor 11 ^-9'"- '^//'^^ °" the number of times each word is used. This task 
oflerms used in the document, excluding stop words and ^°"";'"9 ^^^^^^ ^^,3 the term list to identify the ri.ost 
t accomplished during steps 42. 43. 44. 45, 46. and P^f^.^^^^^^^ 5,"^^ during steps 56. 58. and 60 

Uequr and longest terms, called ^^^J-;- ^^^^^^^^ Procesir 11 selects a subset of the 

Slg^^^oVlnrsr^^^^^ 

^ Given that brief description, let us now "^l^^"^" ^j^^oted -Z", of sentences selected as d.rect theme 

instructions 40 the computer user -V ^^'^ ^ J^^^^^^^^^^ ^et to any a;bitra^ number o, sentences. In an em- 

rmrirdrofdrmrb:^^^^^^ rum's ^y branching to step 42. W.h 

'^^Tocessor 11 responds to the input ^j;;^;:^^Zue document by selecting a word from the 

^si ons abbreviations, determiners, and con.ugat^ns of the - ^ere" are stop words. S op 
Thus for example. English words such as and, ^' ^^.^Ue document to a list of stop words. If the 

wo"; Within the'document are identified by --P^^/:^ had. when the selected word .s not a 

selected word is a stop word pr^^^^^^^^^^^^ . . . . _ .dex - a data 
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sS word, processor 11 branches from step 43 to step i,,„ded in a term index - a data 

^During step 44 processor 11 --P^^^^^^J,,^tetc,"o^^^^^^^^ occurrence of that term. It the selected wo^ .s 
structure associating words of the ^^-^/^^^^ JJ^^^^.'e: ani adds an entry to the term index ^or^^J^^^^^ 
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branches from step 43 to step 46. During step 46 PL^^^^";" enTry. Thus, for exarr^ple. if 

step 47. Pressor 11 ther. determir^es whether all words .n the "^^^^^^ the term list 

11 has not completed the term index. In response. P^°=«=^°V,hlTcun^^nt has bee^ examined, then the term index 
in the manner described. On the other hand, ,f every word of the ^oc^mem^^^^^^ ^^^^^^^ ^^^p 

is complete and processor 11 can turn .ts attenfon to J^^^^^^^^ H determines the number 

sconces Prelsrably. Kis determined according » me «,uatlon: 



Sci^ ^ " ^ 3 otherwise; 



{ 



Z X ci if Sx ci> 3 



where: 



25 



c, is a constant whose value is less than 1 : 

Z is the number of sentences in the thematic summary: and 

Kis the number of thematic terms. 
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In one embodiment, the value of c, is set equal ^° ° 46 processor 11 begins the process of 

Armed with a value for K and the term ^^^^'^J^^^^'^^^Z:^^^^^^ the term index according to their counts; 
selecting /Cthematic terms. During step ^0 ^processor 11 ^^^^^^^^^ »,aving the same count 

i e . the total number of occurrences of each term rn the . and if that fails, arbitrarily Having 

are preferably broKen in favor o. the '-'"t^dextS^^^^ ^° ^° ^'^^ 

?e;%fproc:riT 

advances to step 54. .rronrp«^ of the fC thematic terms in the document. 



45 



Step 56 from step 54. . ^«nnte nrocessor 11 is ready to begin evaluating the 

Having selected the thematic terms and ^^^^/"^'"^^^^f^^f.^^'g ^go^^^ considers only those 

thematic content of the sentences of the ^^^^^^ rm\ whSs ea '^^^ done given the infom^ation included in the 
sentences that include at least one of the /Cthemat.c J^^J^^^'^^^^^^^^^ sorted term index. After selecting 

term index. Processor 11 does so by examining associated with during step 58. For 

:rs^rr. ro^lSS rrP^ri^^^^^^^^^^ ^cor. Preferab. the score for each 

Since is incremented by 5. where 5 is expressed by the equation: 



5 = county (Cg + freq^): 



so 



where: 



count, is the number of occurrences of in ,^the sentence 
55 C2 is a constant having a non-zero, positive value; and 

freqt^ is the frequency of the selected term 

freqf^ is given by the expression: 
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where: occurrences ot thematic terms within the document. 

sentence scores can be tracked genera g includes that sentence 1.DJ1 not tn 

list already includes the parfcular sentence I.D.. ^^^^ ^^^^ 

throughsteps56.5B.ana D ^^^^ go with the highest scores. Processor 

event occurs, processor 11 ^rancj^^^ J,^ ^^^ematic sentences the Z 11 branches 

During step 52 Pf°'^«=^V^J„!^i r^tist score. Having selected the thematic sentences, pr ^^^seqi^ent 
11 does this by sorting the senten^ sco e^^^^^ 

tostep63. During step ^3 P^^^^^^^^^^^^^^ tor all other sentences w.th.n the d«;ument to fal ^ ^^^^^^^ 

B.3. r „ „ ^3to„ „,e upper cas» 

placed within floppy disk dnve 22Mnstmc^^^^^^^^^^ 

For these sentences the upper case feature w.H ^^^^^ ^^^^.^^ 

document will be set to false . ..^^^ uppercase features of a document wrth step 8^^^^ document. 

..=rsrn-^^^^^ 
:s:^r;rp3tpr^^^^^^^^^^^^ 

L associated With acronyms or Pro^e^^^^^ 

84. During that step P.^°^^.^^°^ J '^J^ "n upper case letter, there is the poss b.l. y that the se ocessor 
its ASCII representation^ "'^^rs Jr 11 r^^^^^^^^^ this possibility by ^^^^^^^'^^^l^^jZoX the first word of 
a proper name or acror^ym^ ^^^'^^d word is the first word of the sentence. If the s««ec Jd ^^^^^ ^^^^ 33 

11 determines whether the selec^ea wo acronym. Processor 11 ^X'ts steP , .^^^^^^^ 

Onih^otHerhand.-.f the t«d,^°«^'^^^^fp°^ „pper case list. That done ''^l^'lT^O. 92, 94, and 

55 processor 11 -nKs the words .n the upp^^^^^^^^^^ ^ ^^^^^^ ^,,,3. , the P-'-^^^^^tsl^'r^^ 
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.h sentence ot the document.an-d branches 
^^ is readv to begin scoring each sentence 



_ CountoUelectedja 



pei^^ewordin^^ 
Total Upper 
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,,ept24pr^essori;«^^^^^ 
talse the upper case leei 
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B.4. BeducinaFeaU^^ p3,,,,ed in a can be evaluated 

given iust a ^^^J^";^^,^^^^^^^^ of ordinary skill .n f^^'^^ted simply 9'V en a sentence- Bo^^^ 

similarity between steps 4^-^ 

■ rrnliir" P^"habiHt>es or^ftraininq documents 

Tkluation o1 the ^--'^^^^f::Z^Ze^o.- training beg.ns. 
associated document. This mu 



so 
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„ summarvsentencesv^^ matching sentences 

^- Matc h i n n Snmmarv ^ .S pyaouted by processor 11 ^° '^^^^-' ^^^..^ctions 200 and a 
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. ^ disk drive 22. Instructions 

. .,id state memory 25 or on a floppy disk placed w.th.r, floppy d.sk 
in machine readable form in sono including LISP and C + +. manually generated summary 

40 may be realized in any '^'f P"^|^ '^Sies document sentences that may raatch manu Y g ^^^^^ ^^^^^^ 
Briefly described.instruct,ons 200^^^^^^^ ^^^^3^ ^""'r!nma^Temence Processor 11 scores 

sentences a summary -^e^ted d^^^-t with respect tothe ^^f/^" J^^^^^i order. and similar capital- 
each document senterice of ^^^^^ upon commonality of words, g^^^mary sentence a 
document sentences during ^'^P^^^fJ^^l identifies as possible matches for the selected 
ization. Afterward, during stepjis p Execution of instructions 
subset of the highest sconng d°cumen senten ^^^^.^^^ ^^^^^^^^^ '"^'itand t'^*^^^ 

Given that brief l^Zn of the corpus of training document and the r a ^^^^ and 

200 is initiated by ident. c^t.^^ ^^J^^rrlachine readable fom.. Upon '-^^^^-^^^^^^ selects a summary sen- 

erated summaries, all of ^ f summary. Aftenward. during ^^eP 204 P"^ ,^^15 original sentence. 

During step 208 processor 11 initializes ,. .„rnan/ sentence and designates 

to step 210 from step 208^ ^^^^ ^^,e ^ords of ^^^^f J f f^^^^for ^ selected summary word 

deferred until later. On the ^o step 21 4 from step 21 2. processor 1 1 will increase 

2. IS ihe (US. occurrence ol the s.iw ^^^^^ ^^^^^ , 

o..n,.,ep.i.:p--"re:r^rcr.rsTcr:^^^^^^^^ 

S^dsf-ed la«. on me "|^,VoSi m. second q''»=«" '"t"?rerTo^l selected summan^ 
Tom step 2" <° »«P 216. 0«'"9 =^'P^„\Xi„gs,,p216«he<he,mecuu^^^ sco™ 

Capitalization of the selected summary w ^^^^ ^^^^^^^^ ^^^^3^ 

document sentence. considers word order as an indicator of s-milar ny ^^^^^ ^^^^ 

During step 234 processor 11 processor 11 determines whethe the seie« ^^^^ ,^ 
sentehc^and the selected docume^^^^^^^^^^^ 

:o in the -'-^^^ f-^t^e^^ -ic'e of the selected summa^ 

rea7:rsr:^r^^^^^ 

word in the selected document sentence.^T^^ ^^^^^ 



,ord in the selected document sen^ .^^- ^^ 
selected summary sentence. Said the ca 

ss w,.,: cat 
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Tho hat resembled a cat. -rat " as it does in the selected 

Tr:,"-£^s not 

In this example, tv, hat. aoes selected summary word does not prece ^ document sentence 

summary sentence. When the P^^^^^^^^ce processor 11 exrts step 234. Scor.ng of the seleciea 

1. tor having satisfied the word order test. Pro ^ ^^^^ 

increases are justified. 3,^,e ^,„justified if the ^^^!"";'^^^'^„ee and the selected 

Processor 11 considers '"^^^^^^J'^^^ient common to the selected step 238 and 

because stop words are not ^^ted summary word is a stop -^^^^^'^^^^^^l'' pressor 11 con- 

summary sentence. Upon J 'Jf^^ during step 248 will be briefly deferred. word is 

advances to step 248. Discuss.on of J^^^^'^^,^^^^^^^^^ '"^''Ip'T d"ettmte how great an increase 

siders further increases to '^^^^l^l^l^^^;^^^^^^^ to step 240 from step 238 ^° ^/J^f^^'^^^^^^^ summary word 
not a stop word. In this ='^"^^'°";^P;°;^^^^^^^^^^ determines whether the current "^^f^^^^^^ and increases the 

ZrThis indicates comp.eUon of --g o^t^^^^^^^^^^^ ^^^^ , ,,g. 

summary sentence. When J^r/^g'^hether it has completed sconng ^'j'^ ^"^"^^^^^Xiginal sentence as 

Processor 11 determines during step «. g^ep 206 and designates anotner oi y 

Having selected the subse o P^-f^V ^^^^^^ « .^s selected '-^^'^^ ong-nal ^^^^^^^^ 
During step 260 processor 1^ ff^f.Zs to step 204 to begin the process of 'dent.fy.n9 a nn^^ 9 « 
o, the summary. If not. processor i ret^^^^^^^^ ^^^^^^^t-i.^^rd^-ines -^^^^^^ 

for another summary sentence o^ '^,!,.essor 11 advances to step 262. P^°*=^==°^j'^ff,'^ inches back to step 202 
sentence of the selected ^"'"'^^'^ ^"''^''Xe eo^P"= ^^""9 ''^^ " Horever if matches for the 

matching sentences for ^ ^^-^^-^^^^ ^ent summary pair of the tra»..ng -P-;- ur^nt tasK complete, 
and begins the process ^f'^'l^l^^^^^^^^, exits step 262 and advances ^°^^P ^^'l' instructions 200. 

entire corpus have -den P---^ ,,,,ences ^°\^^f^^^ZZT^:^^^^^^^^^^^ of matching 

Given identification of multiple pos. „„*erablv are. manually selected, l-.nai m ^hich no 
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o T,.ininn to Gen rrn- "° » Probabilities 
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30 



of the features for all '"^eSs a feature to evaluate, the selected for all 

tor all sentences within ^^f^^^^^^^Tof ^"nTenT^^^ within the selected document .n ^.ch F .s true ^ an ^^^^^ 
processor 11 determines ^l^^ ""^f^ "'"^q^a^tities generally by " DocNpv" to '"d'^^l^ le selected feature can 

generally as Total DocNpv- ^"''Sf^^f""^^^ n^niber of times takes each °; P°^^'„ ^^^^^^ tor this 

During step 310 processor ^1 cl^t -.n^^^^^^^^^^ ,,,33 quantities generaUy - ^c^^pv^ ^^/J,^^^^, ,,3,,,, 
matching original sentences of D Le «J P ^^^^^^ ^^^^^^^ .TotalmatchNf=v Evaluation 
particular document are then a^ded ^^^^^^^^ ^^^^ 3^2. 3,„,ted document, 

complete, processor 11 exits siep ^^ether it has determined all feature vaiu processor 11 

processor 11 determines ^"""9 step 31 2 w ^^^^^ ^^^^^^^ ^^^^^^ T Sure values for each 

,t processor 11 V%^^^^^^^^ 3l2^n the ^^^^^^^^^XX^^^X^^^ 

then .xecules sleps 302. JU". • . „ „, ,he salecwd documenl is comp»i h 

,.a,u,»ha.e been «6'm,n,d. When e.al.^..on „„„..,.<, ,o, even, ci«=.rhent»«hlh 

and branches lo step 314. , values ha.e "'"If Sf^s 30S. 31 0. 31 2. and 

''™°T»lsT„Tp'oS 
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summaries. 

IV Method of A..tnmaticallv Ext r^ntina Summary Sentences 

^^n used bv orocessor 11 to automatically extract the same 
Figure 7 illustrates in flow diagram form instructions 350 "^^^ °y P^^^ 3^^^^^ in machine readable form in 



?n=a;rer^^^^^^^^ extracts the highest scoring , 

,5 sentences and presents them to the : 
Giventhat brief description, now consider Figure 7 J^" J^'f^""^^^^^^ be generated. In response to selection 
user indicates a document in --^'"^^^If '^J^^^J^,^ the selected document. 

^;Zr preferably one. Processor 11 then ''--^-/-^fj,^^^^^ 

i With step 356 processor 11 begins ^^^J^/^^f^fg^^^^^^^^ a feature for evaluation from the 

in an extract of the selected document. P^°^ff .^^^9";^„*^3^f^^^^ of the feature for selected sentence during 
set of features during step 356. Processor 11 ^^^^^ ^^^^J^'Te^^^^^^^^^^^ processor 11 looks up the probability 

step 358. Processor 11 then proceeds to step ^l^^ %r'2llsS^^^^ modif es the score for the selected sentence 

2S associated with it, during step 360. Ne>d ^unng ^^^Pj.^^ processor 1 m ^ 3,^p 362 p.^essor 

by an amount proportional to the P/°f^';^Xn°^^^^^ the probS identified during step 360. 
11 simply multiplies the score for the selected ^^"^f " Pj determines during step 364 whether all the values 

Having completed the evaluation of one feature^p ocessor 1 ermines ^ J completed its scoring of 
of al, features for the selected sentence ^ava been determ.n^^^^^^ 'setrs ep 3°64 Ld executes s'teps 356. 358. 360. 
30 the selected sentence. In that case, processor 11 /'!^P '^3^^d to reflect the values of all features. When 

362. and 364. until the score of the ^^'f ^"^^^^^^^^^ step 364 and branches to step 366. 

processor 11 completes the scoring of the reference. Having completed 

Duringstep 366 processor 11 ^^^^ V, h ITllneL dur subs^^^^^^^^^ 

ai serwrces. processor 11 a*anoss ,o f P/™ J'""! ,„ .^^ documonl exlract during step 370. In 

Processor 11 solocu a sgMM of .ho h.ghssl ''"""J '° OS adjustM by rhs use, from a delaul, 

I . ,ne preferred en*od™em. .he number ol senlenoos '"''"''^^'"'''VSnrseLter^ me sentence, to be extracted. 
^ ,alus. Prelerabl,. ttie default length of the ''^'^'''''^^^"'^^l,^ 2'Sa» s«»ln9«ie thematic summary 

processor 11 may present the extract to the 'T'^^^'^^^^^^^^^ are pr.senled m order of their occur. 

« prSrl^S^ecaus, l^tTsrfS'p^S^^^^ - 

autorti™rSnratrr.rrru?er^^^^^^ 

•Tg»a:d'lOgUsome.dlcat.nof,h.p_e.^ 

^p;ru;ri^™srerr^=^^^^^ 

V/ ■=„=ii.,ation of the Mp thori »t Extracting Summary Sentences 
« Themelhodsiusldeschbed^ereapplledtoatratn^o^usp^^^^^^^^^^^^ 

The El corpus included 1 88 document-summary P^"=;. 'sth summary »as created by a prolesslonal 
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10 



The size of the El corpus did not permit resen/ation of a separate test corpus for evaluation For 'h-s reason a 
cros^vaSion strafegy was used to evaluate the performance of the extracting summanzer ,ust descr.bed^Docu- 
ZVl u^T!^^^^^ selected for testing one at a time; all other documents were used for training^ Results 

7J!^sZ.lSTJt^^a!^. Unmatchable and incomplete sentences were excluded from both training and testing. 
Performance was evaluated two ways: 

1 The fraction of manual summary sentences faithfully reproduced; and 

2. The fraction of manual summary sentences correctly identified by the summanzer. 

Thefractionofmanualsummarysentencesfaithtullyreproducedisastringentmeasureofsummarizerpe^^^^^^ 
. because US l^ited by the sum of all direct matches and all direct joins. For the El corpus the maximum o^^amble by 
"hTmeasu e wT83%. Given the assumption that there is only one correct match for each manual '.^^'^^^'^ ^^^^^^^^^s 
the ~ng rummari^^ farthfully reproduced 35o/o of the manual summary sentences when given the number of 

'"Th?second"measure of summarizer performance.the fraction of summary sentences correctly identified can the- 
oretila; ^eaTh ^00^^^ 42% of the document sentences extracted using the methods described 

•n^ltLTpTel^Tthe^ 
El c^r u^he combination of the location, cue word and sentence length teatures yieU^ed t^^^^^^^^^^^ !;:rrtu es is p r f' 



abstracts of the El corpus. 
25 VI. Conclusion 
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A method of automatically generating feature probabilities using a cornputer fV^^^-T^^^J,^^^^^^^^^ 

mmmmmm^ 

leLr^is calculated and is used to calculate feature value probabilities for all of the features of the feature set. 



Claims 
1. 



A method of automatically generating feature probabilities from a document corpus, each document including a 
multiplicity of sentences, the method of comprising the steps of: 

a) designating as a selected document a document of the document corpus; 
Sd« nnrt nLs a selected sentence a one of the sentences of the selects 

SdeteH ng a value of a ocaTor, feature forthe selected sentence, the location feature having a f^st o^^^^^^^^ 
vLfue L secondliation value, and a third location value, the first location value indicating that the selec ed 
sltence rsTc?.d"d within a beginning portion of the selected document, the second 

iri:diS:^t=^^^^^^ 

g) ifT^entences of the selected document have not been designated as the selected sentence, repeating 
'hiTa? dSems Of the document corpus have not been designated as the selected document, repeating 
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steps a) through g); 

i) determining probabilities for each value of the location feature using the associated counter for each location 
feature value; 

j) determining the probabilities for each value of the upper case feature using the associated counter for each 
upper case feature value; and 

k) generating an extract for a first document presented in machine readable form to the user using the upper 
case feature, the location feature and the probabilities for each value of the upper case feature and the location 
feature. 

A method of automatically generating feature probabilities from a document corpus and a summary corpus of 
model summaries, each document of the document corpus being associated with a summary of the summary 
corpus, each document including a multiplicity of sentences, the multiplicity of sentences including a plurality of 
matching sentences, each matching sentence matching a sentence of the associated summary, the method of 
comprising the steps of: 

a) designating as a selected document a document of the document corpus; 

b) designating as a selected sentence a one of the sentences of the selected document; 

c) determining values for the selected sentence of each feature of a feature set, the feature set including a 
■ location feature and an upper case feature, the location feature having a first location value, a second location 

value, and a third location value, the first location value indicating that the selected sentence is included within 
a beginning portion of the selected document, the second location value Indicating that the selected sentence 
is included within a middle portion of the selected document, and the third location value indicating that the 
selected sentence is included within an ending portion of the selected document, each value of the location 
feature having an associated total counter, and an associated matching counter, the upper case feature having 
a first upper case value and a second upper case value, the first upper case value indicating that selected 
sentence does not include any of a multiplicity of selected upper case phrases, the second upper case value 
indicating the selected sentence includes a one of the selected upper case phrases, each value of the upper 
case feature having an associated total counter and an associated matching counter; 

d) for each feature incrementing the total counter associated with the feature value for the selected sentence; 

e) if the selected sentence is a one of the plurality of matching sentences, for each feature incrementing the 
matching counter associated with the feature value for the selected sentence; 

f) if all sentences of the selected document have not been designated as the selected sentence, repeating 
steps b) through e); 

g) if all documents of the document corpus have not been designated as the selected document, repeating 
steps a) through f); . 

h) for each value of each feature determining a probability using the associated total counter and the associated 

matching counter; and 

i) generating an extract for a first document presented in machine readable form to the user using the feature 
set and the probabilities for each value of each feature. 

3. The method of claim 2 wherein the feature set further comprises a direct theme feature, the direct theme feature 
having a first value indicating that the selected sentence represents a theme of the selected document, the direct 
theme feature having a second value indicating that the selected sentence does not represent a theme of the 
selected document. 

4. The method of claim 2 or 3 wherein the feature set further comprises a cue word feature, the cue word feature 
having a first value indicating that the selected sentence summarizes the selected document, the cue word feature 
havin^^a ^cohd value indicating that the selected sentence does not summarize the selected document. 

5. The method of claim 2. 3 or 4. wherein the feature set further comprises a sentence length feature, the sentence 
length feature having a first value indicating that the selected sentence exceeds a minimum length, and the sen- 
tence length feature having a second value indicating that the selected sentence does not exceed the minimum 
length. 

6. An article of manufacture comprising: 

a) a memory; and . ^ * u i^ i 

b) data stored by the memory, the data stored being accessible for automatically generating feature probabil- 
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ities from a document corpus and a summary corpus of manually generated summaries, each document of 
the document corpus being associated with a summary of the summary corpus, each document including a 
multiplicity of sentences, the multiplicity of sentences including a plurality of matching sentences, each match- 
ing sentence matching a sentence of the associated sunnmary. the method of comprising the steps of. 

1 ) designating as a selected document a document of the document corpus; 
2\ designating as a selected sentence a one of the sentences of the selected document; 
3 detemiining values for the selected sentence of each feature of a feature set. the feature set including a 
location feature and an upper case feature, the location feature having a first location value, a second loca ,on 
value and a third location value, the first location value indicating that the selected sentence is included within 
a beginning portion of the selected document, the second location value indicating that the se^cted sentence 
is included within a middle portion of the selected document, and the third location value indicating hat the. 
selected sentence is included within an ending portion of the selected document, each value of the location 
feature having an associated total counter, and an associated matching counter, the upper case feature having 
a firs upper case value and a second upper case value, the first upper case value indicating that selected 
sentence does not include any of a multiplicity of selected upper case phrases, the second upper case vaU^e 
indicating the selected sentence includes a one of the selected upper case phrases, each value of the upper 
case feature having an associated total counter and an associated matching counter; 

4) for each feature incrementing the total counter associated with the feature value for the selected sentence 

5) if the selected sentence is a one of the plurality of matching sentences, for each feature incrementing the 
matching counter associated with the feature value for the selected sentence; » ,«„»«,inn 

6) « all sentences of the selected document have not been designated as the selected sentence, repeating 

- n doIlImems^'S'the document corpus have not been designated as the selected document, repeating 

SHo^'acVvSufof'^^^^^^^^ 
matching count. 

A programmable text processing apparatus when suitably programmed for carrying out the method of any of claims 
V to 5 the apparatus including a processor, a memory, input/output circuitry and an optional user interface. 
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FIG. I 
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